The goal of this file is to document what has been done so far in my
Stage en méthodes computationnelles under the supervision of
Louis Renaud-Desjardins. Mainly, you will find in this file explanation
of methodological choices for our work and the main results we got so
far. All the explanation of how we fetch the various data is in another
file
C:/Users/jacob/OneDrive - Université Laval/biophilo/fetch_data_2024-11-18.Rmd.
Special thanks to François Claveau, Pierre-Olivier Méthot, Louis Renaud-Desjardins, Thomas Pradeu and Maël Lemoine for their support on this projet.
For any questions or issues, please write at jacob.hamel-mottiez.1@ulaval.ca.
# Some functions to display nice data table and to add percentages automatically
fct_percent <- function(x) {
dt <- x |> mutate(percent = n/sum(x$n, na.rm = TRUE)*100) |>
mutate(across(percent, round, 3))
dt
}
fct_DT <- function(x) {
dt <- DT:: datatable(head(x, 1000),
options = list(scrollX = TRUE,
paging=TRUE,
pageLength = 5))
dt
}
Early in our work came a methodological choice, namely, choosing between two well known databases : Web of Science (WoS) and Scopus. We chose to go with Scopus. In summary, this choice is justified by two main reasons :
Scopus has a wider range of coverage when it comes to the journals we want to investigate, that is the main journals of philosophy of biology. For example, it covers Biological Theory, which is absent of our Web of Science database.
Web of Science for philosophy is missing many citing to cited document link.
If you want in depth details about both databases and their strengths and weaknesses given our corpus, follow along. If you want to see the results, you can skip and go straight to the results section.
Here is how we fetch the information from the different databases.
For Springer, we looked manually at each volume and create an excel sheet with the number of articles per year.
For Web of Science we fetched the data through the Albator database which we got access to via the OST. The detailed information about the SQL query already provided earlier.
For Scopus, we used Scopus API and the rscopus package to get our data. See the code below for the specific workflow.
To compare each database coverage (Web of Science, Scopus and Springer) we will perform our tests based on the well-known Biology & Philosophy journal present in each database.
bp_db <- read_csv(paste0(dir_od,"bp_article_db.csv"), skip = 1) # This is the data from B&P Springer and not WoS.
bio_philo_papers <- read_csv(paste0(dir_od, "bio_philo_papers.csv"))
bio_philo_affiliations <- read_csv(paste0(dir_od, "bio_philo_affiliations.csv"))
bio_philo_authors <- read_csv(paste0(dir_od, "bio_philo_authors.csv"))
citing_articles <- bio_philo_papers$`dc:identifier` # extracting the IDs of our articles
bio_th_db <- read_csv(paste0(dir_od, "biological_theory_bp_2024-10-7.csv")) # This is the data from Bio. Th. Springer and not Scopus.
Let’s start by making sure that we have a good coverage for articles. Unsurprisingly, the number of article listed in Springer are more numerous than in the other databases. At first sight, Web of Science seems to have a better coverage, with a difference of more than 200 articles when compared to Scopus.
fct_DT(
art_all |>
group_by(FROM) |>
summarise(total_N = sum(N)) |>
arrange(desc(total_N))
)
Lets look at the distribution of those articles since the beginning of Biology & Philosophy. We see that 1) Web of Science has a good coverage except for early and recent years of B&P whereas Scopus do pretty well in the same range. However, Scopus seems to lose a lot of articles from the decade 1995-2005 (not shown in the histogram for visibility). However, when we filter only on articles, Scopus gets actually a better coverage overall.
Now that we have a better idea of the articles we are able to get from both databases, lets look at their references.
For Web of Science, we uncover that many many references had no unique identifier (around 1/3 to potentially 1/2).
Here is some quantitative data :
Given that this 31 439 cited document corpus doesn’t contain book and book chapter which we are interested in getting, this is problematic. Compared to this important limitation, Scopus do much better.
We must thank Aurélien Goutsmedt which made us aware of an API given by Scopus. It eased out our work substantially (for more information, see his blog here).
Here, compared to our WoS data, we get more than 68 800 references with close to 100% match between references and articles. As a reminder we had around 64 000 references when going through WoS and almost 1/3 of them had no link to their respective article (in the histogram, see the Cited_ID comparison).
# Completeness of data -----------------------------------------------------
df <- tibble(ref_bp) |> mutate(across(where(is.character), ~ na_if(., "NULL")))
df <- df |> rename(Citing_ID = OST_BK, Cited_ID = OST_BK_Ref, Cited_Year = Year)
# Create a tibble summarizing total rows, NA values, and non-NA values by column
summary_tibble <- tibble(
column = names(df),
total_rows = nrow(df),
na_count = sapply(df, function(x)
sum(is.na(x))),
non_na_count = sapply(df, function(x)
sum(!is.na(x)))
)
summary_tibble <- summary_tibble |>
mutate(percent = non_na_count / total_rows *100) |>
mutate(across(percent, round, 3)) |>
arrange(desc(percent))
summary_tibble_WoS <- summary_tibble |>
filter(column != "UID" & column != "UID_Ref")
summary_tibble_WoS$column <- factor(summary_tibble_WoS$column, levels = c("Citing_ID", "Cited_ID","Cited_Author", "Cited_Year", "Cited_Work", "Cited_Title"))
summary_tibble_WoS <-summary_tibble_WoS |>
mutate(column_adjust = column) |> # this will simplify our work next
mutate(FROM = "Web of Science")
Another reason why it can be interesting to choose Scopus is that it covers more journals in philosophy of biology such as Biological Theory (BT).
Let’s look at the article we are able to fetch with Scopus API compared to the articles listed on Springer for the journal. We created a .csv counting manually all the articles listed on Springer for BT and compared it with what we got with the API.
The first step is to compare the coverage between Springer and Scopus. As we see, both are pretty close Springer getting a little bit less than 60 article more than Scopus.
#Table
fct_DT(
art_all_BT |>
group_by(FROM) |>
summarise(total_N = sum(N)) |>
arrange(desc(total_N))
)
When we look at the specific coverage for each year, we see that the coverage is pretty good. However, 2024 is a strange year where the coverage of Scopus is better than the one of Springer. We don’t understand why at the moment.
As we see, the coverage in Scopus resembles the one in Springer when it comes to articles.
Now, let’s look at the references. For the journal Biological Theory, we get 35 793 references in total.
We have almost 100% non-na entries for a) Cited_Authors, b) Citing_ID, c) Cited_ID, d) Cited_Work which is either the article name or the book name. We should not be too bothered with the fact that the column Cited_Title as around 40% of NA entries since books do not have them typically.
Something that can look more bothersome is the Cited_Year column 40% of NA values. Looking into it, we can easily understand why there is so many NAs.The main reason is because Scopus has done before hand cleaning beforehand, notably for books that have many editions. You can demonstrate this by fetching the data directly from Scopus website and compare it to what we get with the API.
Lets look at an example. Here, we see that the famous book by Richard Dawkins The Selfish Gene has been referred to with different publication years (i.e. 1976 ans 1989). It is also the case for Odling-Smee et al. seminal work Niche Construction: The Neglected Process in Evolution which gets cited with two different year. If we look at the data from Scopus API we see that Dawkins book get no year attributed to it and that they many similar but different entries are under the same unique identifier (scopus_id).
# REGEX FOR VARIOUS REFERENCES' EXTRACTION --------------------------------
# Define extraction patterns
extract_authors <- paste0(
"^",
"(?:[A-Z]+(?:[-'][A-Z]+)*\\s+)*", # Matches first part of author name allowing hyphens and apostrophes
"(?:[A-Z]+(?:[-'][A-Z]+)*)", # Matches last part of author name allowing hyphens and apostrophes
"\\s+[A-Z](?:\\.[A-Z])*\\.", # Matches initials (e.g., J. or J.A.)
"(?:,\\s+",
"(?:[A-Z]+(?:[-'][A-Z]+)*\\s+)*", # Matches first part of additional author names
"(?:[A-Z]+(?:[-'][A-Z]+)*)", # Matches last part of additional author names
"\\s+[A-Z](?:\\.[A-Z])*\\.", # Matches initials of additional authors
")*"
)
extract_year <- "\\b(\\d{4})\\b"
extract_journal <- "[A-Z][A-Za-z\\s]+(?=\\,\\s\\d)"
extract_volume <- "\\b\\d+\\b(?=\\,|\\s)"
extract_issue <- "(?<=\\,\\s)(?:[A-Z])?\\d+(?=\\,|\\s|\\()|(?<=\\,\\s)[A-Z]\\d+(?=\\,|\\s|\\()"
extract_pages <- "\\bP{0,1}\\.\\s*\\d+(-\\d+)?\\b"
references_extract$references <- toupper(references_extract$references)
extraction <- function(ref) {
# Extract components
year <- str_extract(ref, extract_year)
authors <- str_extract(ref, extract_authors)
journal <- str_extract(ref, extract_journal)
pages <- str_extract(ref, extract_pages)
# Extract volume and issue separately
volume_issue <- str_extract(ref, "\\b\\d{1,4}\\b(,\\s*\\d{1,4})?")
# Split into volume and issue if both are present
if (!is.na(volume_issue)) {
volume_issue_split <- str_split(volume_issue, ",\\s*")[[1]]
volume <- volume_issue_split[1] # First part is the volume
issue <- ifelse(length(volume_issue_split) > 1, volume_issue_split[2], NA) # Second part is the issue, if it exists
} else {
volume <- NA
issue <- NA
}
# Clean up formats
year <- str_trim(year)
pages <- ifelse(!is.na(pages), str_extract(pages, "\\d+(-\\d+)?"), NA)
# Create a vector of extracted components
extracted_parts <- c(authors, year, journal, volume, issue, pages)
# Remove extracted parts and clean the remaining reference
remaining_ref <- ref %>%
str_remove_all(paste0(extracted_parts, collapse = "|")) %>%
str_remove_all(",\\s*") %>%
str_remove_all("\\s*\\(.*?\\)\\s*") %>%
str_remove_all("P\\.\\s*|PP\\.\\s*") %>%
str_remove_all("^\\s*|\\s*$") %>%
str_trim()
tibble(
extracted_year = year,
extracted_authors = authors,
unique_author = authors, # This extra column is to get all unique author for later count.
extracted_journal = journal,
extracted_volume = volume,
extracted_issue = issue,
extracted_pages = pages,
remaining_ref = remaining_ref
)
}
# Apply the function and handle nested results
results <- references_extract %>%
mutate(
extraction_results = map(references, extraction) # Apply function to each reference
) %>%
unnest_wider(extraction_results) # Unnest the tibble returned by `test_extraction`
#write_csv(results2, paste0(dir, "results2.csv"))
# View the results
fct_DT(results)
results_split <- results %>%
separate_rows(unique_author, sep = ",\\s*") # Split authors into multiple rows
fct_DT(results_split)
# SAVE RESULTS ------------------------------------------------------------
write_csv(results, paste0(dir_od, "cleaned_ref.csv"))
write_csv(results_split, paste0(dir_od, "cleaned_ref_split.csv"))
count_ref_art <- results |>
filter(!is.na(extracted_authors)) |>
select(extracted_authors, extracted_year, remaining_ref) |>
add_count(remaining_ref, extracted_authors) |>
unique() |>
arrange(desc(n))
fct_DT(count_ref_art)
Now that we have checked that the coverage for got both articles and their references is satisfying, we can dig into some of the results. Let’s start with the articles.
clean_references_bp <- clean_references_bp |> mutate(year = as.Date(year)) |> mutate(year = year(year)) #To get only the year.
most_c_ref_bp <- clean_references_bp |> filter(!is.na(sourcetitle)) |>
select(scopus_id, author, year, sourcetitle, title, type) |> add_count(sourcetitle, title, author, scopus_id) |> arrange(desc(n)) |> distinct()
most_c_ref_bp <- fct_percent(most_c_ref_bp)
fct_DT(most_c_ref_bp |> select(-scopus_id, -type))
clean_references_th <- clean_references_th |> mutate(year = as.Date(year)) |> mutate(year = year(year))
most_c_ref <- clean_references_th |> filter(!is.na(sourcetitle)) |> select(scopus_id, author, year, sourcetitle, title, type) |> add_count(sourcetitle, title, author, scopus_id) |> arrange(desc(n)) |> distinct()
most_c_ref_th <- fct_percent(most_c_ref)
fct_DT(most_c_ref_th |> select(-scopus_id, -type))
# BIOLOGY & PHILOSOPHY
clean_references_bp <- clean_references_bp |> add_count(scopus_id, author, sourcetitle, title, author_list_author_ce_initials, year, name = "most_n")
clean_references_bp <- clean_references_bp |>
distinct() |>
filter(!is.na(scopus_id))
clean_references_bp <- clean_references_bp |>
group_by(scopus_id) |>
filter(most_n == max(most_n)) |>
slice_head(n = 1) |>
ungroup()
rank_bp <- clean_references_bp |> mutate(rank_in_bp = dense_rank(-clean_references_bp$n)) |> arrange(desc(n))
# BIOLOGICAL THEORY
clean_references_th <- clean_references_th |> add_count(scopus_id, author, sourcetitle, title, author_list_author_ce_initials, year, name = "most_n")
clean_references_th <- clean_references_th |>
distinct() |>
filter(!is.na(scopus_id))
clean_references_th <- clean_references_th |>
group_by(scopus_id) |>
filter(most_n == max(most_n)) |>
slice_head(n = 1) |>
ungroup()
rank_th <- clean_references_th |> mutate(rank_in_th = dense_rank(-clean_references_th$n)) |> arrange(desc(n))
cited_authors_tbl <- full_join(
rank_th |> select(scopus_id, author, year, sourcetitle, title, rank_in_th),
rank_bp |> select(scopus_id, rank_in_bp),
by = "scopus_id")
fct_DT(cited_authors_tbl |> select(-scopus_id))
cited_journals_bp <- clean_references_bp |> select(scopus_id, sourcetitle, title) |> filter(!is.na(title)) |> count(sourcetitle) |> arrange(desc(n))
cited_journals_bp <- fct_percent(cited_journals_bp)
fct_DT(cited_journals_bp)
write_csv(cited_journals_bp, paste0(dir_od, "cited_journals_bp_2024-12-07.csv"))
cited_journals_th <- clean_references_th |> select(scopus_id, sourcetitle, title) |> filter(!is.na(title)) |> count(sourcetitle) |> arrange(desc(n))
cited_journals_th <- fct_percent(cited_journals_th)
fct_DT(cited_journals_th)
journal_rank_bp <- cited_journals_bp |> mutate(rank_in_bp = dense_rank(-cited_journals_bp$n)) |> arrange(desc(n))
journal_rank_th <- cited_journals_th |> mutate(rank_in_th = dense_rank(-cited_journals_th$n)) |> arrange(desc(n))
journal_rank_all <- full_join(journal_rank_bp |> select(sourcetitle, rank_in_bp),journal_rank_th |> select(sourcetitle, rank_in_th), by = "sourcetitle")
fct_DT(journal_rank_all)
keyword_bp <- bio_philo_papers |> select(citing_art, dc_creator, year, authkeywords, prism_publication_name)
keyword_bp_cleaned <- keyword_bp |>
separate_rows(authkeywords, sep = " \\| ") |>
filter(!is.na(authkeywords))
keyword_bp_cleaned$authkeywords <- toupper(keyword_bp_cleaned$authkeywords)
# KEYWORD PLUS WORD CLOUD
keyword_bp_count <- keyword_bp_cleaned |>
select(authkeywords) |> count(authkeywords, sort=TRUE)
keyword_bp_count <- fct_percent(keyword_bp_count)
fct_DT(keyword_bp_count)
keyword_th <- bio_th_papers |> select(citing_art, dc_creator, year, authkeywords, prism_publication_name)
keyword_th_cleaned <- keyword_th |>
separate_rows(authkeywords, sep = " \\| ") |>
filter(!is.na(authkeywords))
keyword_th_cleaned$authkeywords <- toupper(keyword_th_cleaned$authkeywords)
# KEYWORD PLUS WORD CLOUD
keyword_th_count <- keyword_th_cleaned |>
group_by(authkeywords) |> count(authkeywords, sort=TRUE)
keyword_th_count <- fct_percent(keyword_th_count)
fct_DT(keyword_th_count)
Now that we have these tables, it might be of use to visualize those keywords and their importance.
An important thing to note is that we don’t have access to the keywords of the references provided by Scopus
Here, we compute what we call citation delay. It is computed as the difference between the article publishing year and the mean year of the references the article cites. Here is the cumulative distribution function that shows the evolution of this citation delay as the journal gets older.
# BIOLOGY & PHILOSOPHY
bio_philo_papers <- bio_philo_papers |> mutate(date = as.Date(prism_cover_date)) |>
mutate(year = year(date))
delay_refs_bp <- clean_references_bp |>
rename(cited_year = year) |>
left_join(bio_philo_papers |> select(citing_art, year),
by = "citing_art") |>
arrange(desc(citing_art))
delay_refs_bp <- delay_refs_bp |> mutate(delay = year-cited_year) |> mutate(from = "B&P")
delay_refs_bp$decade <- cut(delay_refs_bp$year,
breaks = c(2006, 2014, 2023), # Include up to 2024
labels = c("2006-2013", "2014-2021"),
right = FALSE) # Left-inclusive
p1 <- ggplot(delay_refs_bp |> filter(!is.na(decade)), aes(x = delay, color = decade, group = decade)) +
stat_ecdf(geom = "step", show.legend = FALSE) +
labs(title = "Citation Delay Biology & Philosophy (1987-2022)",
x = "Delay",
y = "CDF") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(limits = c(0, 50))
# BIOLOGICAL THEORY
delay_refs_th <- clean_references_th |>
rename(cited_year = year) |>
left_join(bio_th_papers |>
select(citing_art, year), by = "citing_art") |>
arrange(desc(citing_art))
delay_refs_th <- delay_refs_th |> mutate(delay = year-cited_year) |> mutate(from = "BT")
delay_refs_th$decade <- cut(delay_refs_th$year,
breaks = c(2006, 2014, 2023),
labels = c("2006-2013", "2014-2021"),
right = FALSE) # left-inclusive
p2 <- ggplot(delay_refs_th |> filter(!is.na(decade)), aes(x = delay, color = decade, group = decade)) +
stat_ecdf(geom = "step", show.legend = FALSE) +
labs(title = "Citation Delay Biological Theory (2006-2022)",
x = "Delay",
y = "CDF") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(limits = c(0, 50))
# BOTH B&P AND BT
all <- rbind(delay_refs_bp, delay_refs_th) |> filter(!is.na(decade))
p3 <- ggplot(all, aes(x = delay, color = decade, group = decade)) +
stat_ecdf(geom = "step") +
labs(
x = "Delay",
y = "CDF") +
theme(plot.title = element_text(hjust = 0.5),
legend.position="top", legend.title = element_blank(), ) +
scale_x_continuous(limits = c(0, 50)) +
facet_grid(rows = ~ from)
ggplotly(p3) |> layout(legend = list(title = FALSE, orientation = "h",
xanchor = "center",
x = 0.5, y = 1.2))
If this shift is interesting, we need to be careful. It could be only because as we go, we still cite old stuff, thus creating an artificial shift to the right not really problematic. Let’s look at the distribution of citation delay.
p4 <- all |> ggplot(aes(x = delay, group = from, fill = decade, color = decade)) +
geom_density(alpha = 0.5) +
facet_grid(cols = vars(from), rows = vars(decade))
ggplotly(p4) |> layout(legend = list(title = FALSE, orientation = "h", # show entries horizontally
xanchor = "center", # use center of legend as anchor
x = 0.5, y = 1.2))